Important Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs, and appropriate color choices. There’s one exception: only one of your graphs needs to be drawn with color vision deficient friendly colors. Choose one, and show that that graph passes the test by including a screenshot taken with a color vision deficiency simulator such as Color Oracle (http://colororacle.org/).

All graphs must be drawn in R. I have provided package suggestions but you may choose other options.

You are expected to develop a basic subject matter understanding of the data, looking up, for example, whether a variable is ordinal or nominal, what the units are, etc. as relevant.

Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.

Read Graphical Data Analysis with R, Ch. 6, 7

library(EDAWR)
library(GGally)
library(parcoords)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(ggalluvial)
library(vcd)
library(ggalluvial)
library(tidyr)

1. Nutrition

[6 points]

Data: nutrition dataset in EDAWR package. Install with:

remotes::install_github(rstudio/EDAWR)

Package suggestions: GGally, parcoords

  1. Choose 3 food groups (group column) and at least 8 numeric variables and create a static parallel coordinates plot in which each line represents a food item. Color by food group. You may random sample if you find there is too much data to reasonably display. Choose parameters that best show trends in the data: experiment with alpha blending, scale, order of columns, splines (if available), and size of sample.
data_new <- nutrition %>% select(c("zinc","iron","calories","potassium","protein","niacin","magnesium","carbohydrates","group"))
data_new <- data_new[sample(1:nrow(data_new),3000),]
data_new <- filter(data_new, group %in% c("Poultry Products","Cereal Grains and Pasta","Beef Products"))

ggparcoord(data_new,columns = 1:8 ,groupColumn = "group", scale = "uniminmax", alphaLines = 0.3, splineFactor = 0, order = "skewness") +
  theme(legend.position = "none")

#为啥要把legend隐藏
  1. Repeat part a) with the same data but this time create an interactive parallel coodinates plot.
data2 = rbind(data_new[3,],data_new[-3,])
parcoords(
   data2,
   rownames = FALSE
    ,brushMode = "1d-multi"
    ,reorderable = TRUE,
   color=list(colorBy = "group"),
     withD3 = TRUE
 )
#我改了一下颜色,为了第三问能方便看一点
  1. Describe clusters, correlations, outliers and anything else of relevance in your plots in detail.

2. Electric Cars Rebate

[9 points]

Data: NYSERDA Electric Vehicle Drive Clean Rebate Data: Beginning 2017

Package suggestions:

Mosaic plots: vcd or ggmosaic

Treemap: treemap

From the NYS web site: “New York State’s Charge NY initiative offers electric car buyers the Drive Clean Rebate of up to $2,000 for new car purchases or leases. The rebate amount depends on the battery-only range of each vehicle. Dealers enrolled in the program deduct the eligible amount from the vehicle price at the point of sale and then submit a rebate application with NYSERDA. This dataset includes all completed rebate applications as of the data through date.”

https://data.ny.gov/Energy-Environment/NYSERDA-Electric-Vehicle-Drive-Clean-Rebate-Data-B/thd2-fu8y

You can read the data directly from the site with:

read_csv("https://data.ny.gov/resource/thd2-fu8y.csv")

Draw the following graphs and answer the questions.

  1. Is there an association between transaction type and rebate amount? Justify your answer with a mosaic plot, treating rebate amount as the dependent variable.
data2<-read_csv("https://data.ny.gov/resource/thd2-fu8y.csv")
mosaic(rebate_amount_usd~transaction_type,data2)

There is an association between transaction type and rebate amount because the distributions of rebate amount are obviously different with different types of transaction.

  1. Perform a chi square test to test for association between transaction type and rebate amount. What is the result? Is it the same as in part a)?
test <- chisq.test(table(data2$transaction_type, data2$rebate_amount_usd))
test
## 
##  Pearson's Chi-squared test
## 
## data:  table(data2$transaction_type, data2$rebate_amount_usd)
## X-squared = 69.463, df = 3, p-value = 5.562e-15

The p-value of chi square test is too small to accept the null hypothesis, and so we will assume type of transaction and rebate amount are related. It is small with result in (a).

  1. Draw a mosaic plot of the top five car makes by year in this dataset. Is there an association?
top5 = arrange(plyr::count(data2, "make"),desc(freq))[1:5,1]
top5data = data2[data2$make %in% top5,]
top5data = mutate(top5data, year = format(as.Date(top5data$submitted_date,format="%Y-%m-%d"), "%Y"))
df2c = top5data %>% group_by(make,year) %>% dplyr::summarise(Freq = n())
vcd::mosaic(year ~ make,direction = c("v","h"),labeling = labeling_border(rot_labels = c(0,0,0,45),abbreviate_labs=c(3,2)),top5data)

d) Add transaction_type to your plot from part c). What new information do you learn?

mosaic(year ~ make+transaction_type,direction = c("v","v","h"),labeling = labeling_border(rot_labels = c(0,0,0,45),abbreviate_labs=c(3,1,2)),top5data)

e) Use the base pairs() function to draw a mosaic pairs plot of ev_type, transaction_type, rebate_amount_usd and year. Based on the plot, list all pairs of variables from strongest association to weakest association. (Note: The vcd package must be loaded for pairs() to find the correct method.)

library(vcd)
data2_e = mutate(data2, year = str_sub(data2$submitted_date,1,4))[,c(7,8,11,12)]
pairs(xtabs( ~ ev_type+rebate_amount_usd+transaction_type+year, data = data2_e))

  1. Draw a treemap of the make and models of the cars in this dataset.
library(treemap)
data2_f<-group_by (data2,make,model)%>% dplyr::summarise(count=n(),.groups = 'drop')
data2_test<-group_by(data2,make)%>%dplyr::summarise(count=n())
treemap(data2_f,index=c("make","model"), vSize="count",vColor = "count",title='treemap of make and models',palette='Set3')

3. Occupational Mobility

[6 points]

Data: Yamaguchi87 in the vcdExtra package

Package suggestion: ggalluvial

  1. Draw an alluvial diagram showing upward / downward mobility by class, combining data from all three countries together.
library(vcdExtra)
library(ggalluvial)
df3 = Yamaguchi87
ggplot(df3,aes(y=Freq,axis1 = Father, axis2 = Son))+
  geom_alluvium(aes(fill = Father),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  #geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))+
  scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class")+
  theme_minimal()

  1. Incorporate the country information into your plot in a way that best illustrates the trends and facilitates comparison.
ggplot(df3,aes(y=Freq,axis1 = Father, axis2 = Country,axis3 = Son))+
  geom_flow(aes(fill = Father),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))+
  scale_fill_brewer(type = "qual", palette = "Set1")

  1. Based on part b), compare occupational mobility in the U.S., U.K., and Japan.
df3_US <-filter(df3,Country=="US")
df3_UK <-filter(df3,Country=="UK")
df3_JA <-filter(df3,Country=="Japan")
plotUS<-ggplot(df3_US,aes(y=Freq,axis1 = Father, axis2 = Son))+
  geom_alluvium(aes(fill = Father),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  #geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))+
  scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(US)")+
  theme_minimal()
plotUK<-ggplot(df3_UK,aes(y=Freq,axis1 = Father, axis2 = Son))+
  geom_alluvium(aes(fill = Father),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  #geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))+
  scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(UK)")+
  theme_minimal()
plotJA<-ggplot(df3_JA,aes(y=Freq,axis1 = Father, axis2 = Son))+
  geom_alluvium(aes(fill = Father),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  #geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))+
  scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(Japan)")+
  theme_minimal()
#library("Rmisc")
#library("plyr")
#multiplot(plotUS,plotUK,plotJA, cols=1)
plotUS

plotUK

plotJA

### 4. State Credit Ratings

[9 points]

Data: state_ratings.csv in the data folder of CourseWorks

Package suggestion: ggalluvial

The original data source is here: https://www.spglobal.com/ratings/en/research/articles/190319-history-of-u-s-state-ratings-2185306

The issuer credit rating (ICR) indication, distinguishes some ratings from general obligation debt ratings. A footnote on this site explains that a state’s issue credit rating is listed in place of a general obligation bond rating when a state does not have general obligation debt outstanding. To be consistent with sites like this, and for the purposes of simplification, the ICR indication was removed.

The dates were converted to years, and in cases in which there were more than one entry per year, the latest was selected. For years in which there were no credit rating changes, the latest credit rating available was added. So for example, Wyoming had a listing of AA+ for 2017 which was reduced to AA in 2020. After filling in the inbetween years, we have the following for Wyoming from 2017-2020:

df4 <- readr::read_csv("state_ratings.csv")
#dplyr::filter(df, State == "Wyoming" & Year >= 2017)
data4 <- readr::read_csv("state_ratings.csv")

If you’re interested, the script to read and transform the data is available here: get_ratings_data.R

  1. Create an alluvial diagram showing rating changes by state over time. You do not have to use all the data. If you choose not to, explain your rationale for subsetting or sampling.
df4a = df4[,c(1,2,5)]
df4a = df4a %>% pivot_wider(names_from = Year, values_from = Rating) %>% data.frame()
df4a = df4a %>% dplyr::mutate(Freq = n())
df4a = df4a %>% filter(State %in% c("Arizona","California","Georgia","Florida","Illinois"," 
New York","Washington","Kentucky","Massachusetts","Michigan","Nevada","Ohio"))
ggplot(df4a,aes(y=Freq,axis1 = X2006, axis2 = X2007,axis3 = X2008,axis4 = X2009,axis5 = X2010,axis6 = X2011,axis7 = X2012,axis8 = X2013,axis9 = X2014,axis10 = X2015,axis11 = X2016))+
  geom_alluvium(aes(fill = State),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))

state_list<-c("Alaska","California","Connecticut" ,"Illinois","Louisiana","New Jersey","Oregon" ,"Pennsylvania","South Dakota","Texas","Wyoming")
df4<-data4
df4a = df4[,c(1,2,5)]
df4a = df4a %>% pivot_wider(names_from = Year, values_from = Rating) %>% data.frame()
df4a = df4a %>% dplyr::mutate(Freq = n())
df4a = df4a %>% filter(State %in% state_list)
ggplot(df4a,aes(y=Freq,axis1 = X2006, axis2 = X2007,axis3 = X2008,axis4 = X2009,axis5 = X2010,axis6 = X2011,axis7 = X2012,axis8 = X2013,axis9 = X2014,axis10 = X2015,axis11 = X2016))+
  geom_alluvium(aes(fill = State),width = 1/12)+
  geom_stratum(width = 1/12,fill = "grey")+
  geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
  scale_x_discrete(expand = c(.05, .05))

b) Create a heatmap with the same data (if applicable use the same subset or sample), also showing rating changes by state over time. (One option: geom_tile().)

df4 %>% filter(State %in% c("Arizona","California","Georgia","Florida","Illinois"," 
New York","Washington","Kentucky","Massachusetts","Michigan","Nevada","Ohio")) %>% ggplot(aes(x=Year, y=State, fill=Rating))+
  geom_tile()

# Heatmap2 
df4 %>% filter(State %in% state_list) %>% ggplot(aes(x=Year, y=State, fill=Rating))+
  geom_tile()

  1. Compare a) and b). What advantages does each have? Which do you prefer?

In heatmap, the ratings of each year for each state are on single line. The color changes alone with rating which is greatly obvious compared to alluvial diagram. Since the rating set and proportion differ for each year, there may be a curve in alluvial diagram even the value doesn’t change by year.

In alluvial diagram, the magnitude of the change are more likely to know. Since rating is a hierarchical value,more than if changes existing, we are also curious about level of changes. The alluvial diagram can reflect degree of change by curvature(length) of flow while heatmap not.